Skip to main content

Realtime Speech-to-Text API

Real-time speech-to-text streaming over a single WebSocket connection.

This endpoint is a dedicated transcription stream: you push raw audio frames in and receive incremental transcriptions and, optionally, translations back as JSON.

Step 1. Get API credentials

Get your Client ID and Client Secret from the Palabra API keys section.

Step 2. Create a session

Exchange your credentials for a short-lived token by calling POST /session-storage/session. The publisher field in the response is the token you pass when connecting.

import requests

def get_token(client_id: str, client_secret: str) -> str:
resp = requests.post(
"https://api.palabra.ai/session-storage/session",
json={"data": {}},
headers={"ClientId": client_id, "ClientSecret": client_secret},
)
resp.raise_for_status()
return resp.json()["data"]["publisher"]

Step 3. Connect

Open a WebSocket to the endpoint below. The token from Step 2 must be passed as the token query parameter. All other stream settings are passed as query parameters in the same URL.

wss://api.palabra.ai/asr/v1/speech-to-text/stream?token=<TOKEN>&language=en&format=pcm_s16le&sample_rate=16000
import websockets

url = (
"wss://api.palabra.ai/asr/v1/speech-to-text/stream"
f"?token={token}&language=en&format=pcm_s16le&sample_rate=16000"
)
ws = await websockets.connect(url)

Query parameters

ParameterRequiredDescription
tokenyesSession token
formatyesAudio format (see Audio formats)
sample_rateconditionalSample rate in Hz. Required for all raw PCM formats; for pcm_s16le required only when the rate is not 16000
languagenoSource language code. Defaults to auto
translate_languagesnoComma-separated target languages, e.g. es,de,fr
enable_filler_filternoWhether to enable the filler filter. true by default for all languages except ja

Step 4. Send audio

Send audio as raw binary WebSocket frames. Chunks of 320 ms are recommended.

await ws.send(data)

Audio formats

formatsample_rateNotes
pcm_s16leonly if ≠ 1600016-bit signed little-endian PCM. Recommended
pcm_f32le / pcm_f32berequired32-bit float PCM
pcm_s32le / pcm_s32berequired32-bit signed PCM
mulaw / alawrequiredG.711
webm / mp3 / aac / ogg / flac / wavnot usedContainer formats; rate is read from the stream

Step 5. Receive messages

All server-to-client messages are JSON text frames. Switch on message_type.

transcription

Emitted continuously as speech is recognized.

{
"message_type": "transcription",
"transcription_id": "a1b2c3d4",
"language": "en",
"is_eos": false,
"segment": {
"text": "Hello world how are",
"start_time": 0.32,
"end_time": 1.84
},
"delta": {
"text": "how are",
"start_time": 1.20,
"end_time": 1.84
}
}
FieldDescription
transcription_idStable id for the segment. All messages of one segment share the same id. A new id means a new segment has started
languageDetected (or configured) source language of this segment
is_eosfalse — partial; the segment is still being updated. true — the segment is committed and final
segment.textThe full text of the segment so far
segment.start_time / end_timeSegment timing, in seconds relative to session start
deltaIncremental hint: the text added since the previous partial of the same segment (see below)

Working with delta

When the filler filter is disabled, delta.text is append-only: each transcription message carries exactly the text appended since the previous partial, so you can concatenate deltas directly.

With the filler filter enabled, the recognizer's tail might be rewritten mid-segment, which breaks the append relationship. In that mode treat segment.text as authoritative and overwrite the current segment on each message; use delta only as a hint.

translated_transcription

Sent only when translate_languages is set, once per target language, after each final (is_eos: true) transcription.

{
"message_type": "translated_transcription",
"transcription_id": "a1b2c3d4",
"language": "es",
"is_eos": true,
"segment": {
"text": "Hola mundo, ¿cómo estás?",
"start_time": 0.32,
"end_time": 1.84
}
}

transcription_id matches the id of the source transcription (the is_eos: true one) this translation was produced from — use it to correlate a translation back to its original segment. language here is the target language, and is_eos is always true (translations are produced only for finalized segments).


Errors

Authentication and routing failures are reported as HTTP status codes during the WebSocket upgrade, before the connection is established:

HTTP statusMeaning
401Missing or invalid token
409A session is already active for this identity

After a successful upgrade, the server does not send application-level error messages over the wire — it closes the connection with a standard WebSocket close frame.


Complete example

Streams microphone audio and prints transcriptions (and translations, if PALABRA_LANGUAGE targets are configured).

pip install pyaudio websockets requests
export PALABRA_CLIENT_ID=... # from Step 1
export PALABRA_CLIENT_SECRET=...
export PALABRA_LANGUAGE=en # source language
import json
import os
import asyncio
import threading
import queue

import pyaudio
import requests
import websockets

WS_URL = "wss://api.palabra.ai/asr/v1/speech-to-text/stream"
SESSION_URL = "https://api.palabra.ai/session-storage/session"
LANGUAGE = os.environ.get("PALABRA_LANGUAGE", "en")

SAMPLE_RATE = 16000
CHANNELS = 1
CHUNK = 5120 # samples ≈ 320 ms at 16 kHz (recommended chunk size)


def get_token() -> str:
resp = requests.post(
SESSION_URL,
json={"data": {
"subscriber_count": 0,
"publisher_count": 1,
"publisher_can_subscribe": True,
}},
headers={
"ClientId": os.environ["PALABRA_CLIENT_ID"],
"ClientSecret": os.environ["PALABRA_CLIENT_SECRET"],
},
)
resp.raise_for_status()
return resp.json()["data"]["publisher"]


def mic_reader(audio_queue: queue.Queue, stop_event: threading.Event):
pa = pyaudio.PyAudio()
stream = pa.open(
format=pyaudio.paInt16,
channels=CHANNELS,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=CHUNK,
)
print("Microphone open, speak now...")
try:
while not stop_event.is_set():
audio_queue.put(stream.read(CHUNK, exception_on_overflow=False))
finally:
stream.stop_stream()
stream.close()
pa.terminate()


async def stream(token: str):
url = (
f"{WS_URL}?token={token}&language={LANGUAGE}"
f"&format=pcm_s16le&sample_rate={SAMPLE_RATE}"
)

audio_queue: queue.Queue = queue.Queue()
stop_event = threading.Event()
threading.Thread(
target=mic_reader, args=(audio_queue, stop_event), daemon=True
).start()

async with websockets.connect(url) as ws:
print("Connected")

async def send_audio():
loop = asyncio.get_event_loop()
while True:
data = await loop.run_in_executor(None, audio_queue.get)
await ws.send(data) # raw binary frame

async def receive():
async for message in ws:
msg = json.loads(message)
msg_type = msg.get("message_type")

if msg_type == "transcription":
text = msg["segment"]["text"]
tid = msg.get("transcription_id", "")
if msg.get("is_eos"):
print(f"\n[EOS] {text} [{tid}]")
else:
# segment.text is the source of truth — render it whole
print(f"\r {text}", end="", flush=True)

elif msg_type == "translated_transcription":
lang = msg.get("language", "?")
tid = msg.get("transcription_id", "")
print(f"\n[{lang}] {msg['segment']['text']} [{tid}]")

try:
await asyncio.gather(send_audio(), receive())
finally:
stop_event.set()


if __name__ == "__main__":
token = get_token()
print("Session created")
try:
asyncio.run(stream(token))
except KeyboardInterrupt:
print("\nStopped.")